Formats over Time: Exploring UK Web History

نویسنده

  • Andrew N. Jackson
چکیده

Is software obsolescence a significant risk? To explore this issue, we analysed a corpus of over 2.5 billion resources corresponding to the UK Web domain, as crawled between 1996 and 2010. Using the DROID and Apache Tika identification tools, we examined each resource and captured the results as extended MIME types, embedding version, software and hardware identifiers alongside the format information. The combined results form a detailed temporal format profile of the corpus, which we have made available as open data. We present the results of our initial analysis of this dataset. We look at image, HTML and PDF resources in some detail, showing how the usage of different formats, versions and software implementations has changed over time. Furthermore, we show that software obsolescence is rare on the web and uncover evidence indicating that network effects act to stabilise formats against obsolescence.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

The Lixto Project: Exploring New Frontiers of Web Data Extraction

The Lixto project is an ongoing research effort in the area of Web data extraction. Whereas the project originally started out with the idea to develop a logic-based extraction language and a tool to visually define extraction programs from sample Web pages, the scope of the project has been extended over time. Today, new issues such as employing learning algorithms for the definition of extrac...

متن کامل

Exploring 70 Years of the British National Health Service through Anniversary Documents

The British National Health Service (NHS) celebrates its 70th birthday on July 5, 2018. This article examines this anniversary through the lens of previous anniversaries. It examines seven documents close to each anniversary over a period of some 60 years, drawing on interpretive content analysis, based on the narrative dimensions of context (structure and finance); success or achievements; pro...

متن کامل

Exploring Entity-Centric Methods in the UK Government Web Archive

Being able to explore large digital collections effectively is of interest to both academics and practitioners alike. The need to go beyond the provision of keyword-driven functionality to features that support exploration and discovery is widely recognised. In addition, providers are seeking to support more diverse groups of users with varying information needs and tasks. Increasing amounts of...

متن کامل

Exploring Effective Advertising Strategies: The Roles of Formats, Content Relevance and Shopping Tasks on Ad Recognition

The widespread application of Web-based technology has contributed not only to the content of advertising but also to the improvement of presentation formats. Animation has become a powerful presentation format on the Web. Despite its potential benefits, however, animation is no panacea. Practitioners and academics have been paying increased attention to the exploration of effective advertising...

متن کامل

The WebDataCommons Microdata, RDFa and Microformat Dataset Series

In order to support web applications to understand the content of HTML pages an increasing number of websites have started to annotate structured data within their pages using markup formats such as Microdata, RDFa, Microformats. The annotations are used by Google, Yahoo!, Yandex, Bing and Facebook to enrich search results and to display entity descriptions within their applications. In this pa...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • CoRR

دوره abs/1210.1714  شماره 

صفحات  -

تاریخ انتشار 2012